06. Parallel Crawler

Step 4. Coding the Parallel Web Crawler

Luckily for you, the forward-thinking author of the legacy web crawler created an extensible design that uses a dependency injection framework. They defined an interface WebCrawler; the legacy code exists in the concrete implementation SequentialWebCrawler. Definitely not the worst legacy code you've ever seen! All you have to do is create a second implementation, ParallelWebCrawler, that has the same functionality as the legacy crawler, but takes advantage of multi-core architectures.

Your task is to fill in the src/main/java/com/udacity/webcrawler/ParallelWebCrawler.java file. Specifically the crawl method:

public CrawlResult crawl(List<String> startingUrls)

The input is a list of URLs where the crawl should start downloading and parsing HTML. For this part, you can reuse the legacy code that downloads and parses web pages. @Inject a com.udacity.webcrawler.parser.PageParserFactory, and use it like this:

PageParser.Result result = parserFactory.get(url).parse();

The implementation of ParallelWebCrawler must actually run in parallel. A ForkJoinPool has already been created for you in the ParallelWebCrawler constructor.

Recall that ForkJoinPool is a kind of Executor that is optimized for efficient processing when most work tasks create other subtasks. In other words, it is ideal for building recursive algorithms that run in parallel. In fact, the most common kinds of subtasks are called RecursiveAction and RecursiveTask. Take a look at the legacy implementation (src/main/java/com/udacity/webcrawler/SequentialWebCrawler.java), and you will notice that also uses a recursive algorithm because the crawlInternal() method invokes itself. Think about ways to take this recursive algorithm and use ForkJoinPool to make it parallel.

There are multiple correct ways to do this. For starters, you will want to create your own custom task class, which should be a subclass of RecursiveAction or RecursiveTask — either choice is fine, but depending on your decision the final code will be slightly different. Each custom task will download and process URLs in a separate thread. Remember how ForkJoinPool works by having its subtasks create more subtasks? To do that, your custom task class will create other subtasks and run them using the static invoke or invokeAll methods.

If it makes sense for the way you structure your implementation, you should consider applying the factory pattern to create these subtasks. You should aim to minimize the presence of large constructor calls with long lists of parameters; the builder pattern can help with this.

Remember that in a parallel program, you can have multiple threads accessing data structures at the same time. Because of this, you should use one or more concurrent data structures to compute the output. Keep the following tips in mind:

  • For the urlsVisited output, the same URL should not be counted twice in the same crawl. Note: if a URL returns an HTTP error (such as a 400 error), the crawler still considers that as a "visit".

  • When computing the wordCounts, the crawler should not accidentally count the results from the same page twice. Remember that the crawler will be downloading and processing multiple web pages at the same time, so this could be tricky!

    Utilities like Collections.synchronizedCollection and java.util.concurrent will be helpful, but when using these data structures, think carefully about which methods are and are not atomic, and what guarantees those methods provide.

Running Tests

As you work, you can run the provided unit tests on your parallel web crawler implementation:

mvn test -Dtest=WebCrawlerTest,ParallelWebCrawlerTest

These tests do not thoroughly test that the crawler uses correct synchronization, but they will provide a useful signal as to whether ParallelWebCrawler provides the same functionality as the legacy code. You will need to look over your implementation to ensure it uses correct synchronization.